Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec  5 23:20:29 2022"

The text continues here.

My thoughts:

I saw an email advertisement of this course and was very keen to join. I would like to learn more about data analytics and I have studied some data analytics using Python and Jupyter Notebook so studying R sounded very interesting. I also took part a couple of years ago to another course organized by Kimmo Vehkalahti (Johdatus Yhteiskuntatilastotieteeseen) and I must say it was all thanks to him that I finally started to understand something about statistics. That course just covers the basics so I am hoping to learn more about statistics in this course. So I am really looking forward to this course.

Reflecting my learning experience

I think using the exercise1 together with the book makes it easier to go through the topics. I first tried to read just the book but it gets a bit tedious. Going it together with the exercise1 makes it a bit more interactive. The syntax in R is somewhat familiar with other programming languages but it also has many differences, so currently I am a bit worried if I remember all those functions and commands. Luckily the R for Health Data Science seems to be quire easy to browse so it will not be a big problem to check some of the functions or syntax later again.

I just did the first four parts for now and it covers the basics so there was nothing too difficult yet. I always struggle with join function and I probably will do that again but it seemed that R is quite helpful here too as you can see the results of the joined tables very easily.

GitHub repository:

https://github.com/Ankinen/IODS-project


Analysing learning2014 data using multiple regression

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec  5 23:20:30 2022"

Read the learning2014 data and explore the structure and dimensions of it.

learning2014 <- read.table("data/learning2014.csv", sep=",", header=TRUE)
dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

dimensions and structure of the data

There are 166 observations and 7 variables in this data set. The variables are ‘gender’, ‘age’, ‘attitude’, ‘deep’, ‘stra’, ‘surf’, and ‘points’. The ‘deep’, ‘stra’, and ‘surf, refer to students’ learning styles, deep, surface, and strategic learning.

We are now using this data to try to explain the relationship between the points (dependent variable, or y) and age, attitude, deep, surface, strategic (the explanatory variables.) In explainig we use multivariable (or multiple) regression, which is one of the methods used in linear regression.

Graphical overview

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

#graphical overview
plotmatrix  <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
plotmatrix

Summaries of the variables

summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Linear Regression - explaining the plot and summaries

In linear regression we want the the observed data to fit in to a descending or ascending line, because that would mean there most like is some kind of correlation. When the line fits well we can see a linear relationship between the explanatory and dependent variables. If the straight line does not fit well, we need to consider an alternative model. (We will see some regression lines using learning2014 data later.)

The observations should also be normally distributed. If this is the case, the residuals should show a normal distribution with a mean of zero. If we observe the histograms above, we can see that most of them are not normally distributed, but this still too early to do any definite conclusions about the data.

Here we just have a first glimpse of the data so that we can see how it all looks like.

The third assumption that we have in linear modelling is that the residuals are independent. This is more of a concern when doing time series data analysis.

The last assumption is that the residuals have equal variance. This means that the distance of the observations from the fitted line should be the same on the left and on the right.

From the plot matrix we can observe that there are some possible outliers and that gender does not seem to play a big role as both green and red dots are scattered more or less evenly.

From the summary table we can see that the Median and mean values of each variable as well as their minimum and maximum values and the values in 25% and 75% mark.

Let’s further check the relationship between points and three explanatory variables, attitude, stra and surf.

Regression model for points + attitude, stra and surf

# Fit a regression model where `points` is the target variable and `attitude`, `stra` and `surf` are the explanatory variables.
linearmodel <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out the summary of the new model
summary(linearmodel)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Call:

Here we can first see the function call where the points refers to the dependent variable and the variables after the ~ are the explanatory variables.

Residuals:

Next is the residuals they tell the difference between the fitted line and the observations, here the median residual is 0.52. If our model fits well, the residual median should be close to zero and Q1 and Q3 should also be close to each other in magnitude (they are quite near). Also the maximun and minimum values should be near each other. However, in case of the maximum and minimun values, there might be some outliers which are affecting the results.As we can see from the residual boxplot this, indeed, seems to be the case.

#residual boxplot

boxplot(resid(linearmodel))

Coefficients:

Estimates

Intercept represents the mean value of the expected response when all the predictor variables (explanatory variables) are equal to zero. Intercept is not always sensible number as sometimes the variable cannot have a value of zero. In this case it does make some sense because it is possible that the points are zero. For other features (the explanatory variables) the estimates give the estimated change in point when they are changed a unit.

Standard Error Standard error, as the name suggests, is the standard error of our estimate. This allows us to construct marginal confidence intervals. Here you can find more information about the ways to handle confidence intervals in R.

t-value t-value tells us how far our estimated parameter is from a hypothesized 0 value, scaled by the standard deviation of the estimate.

P-value The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (i.e., no effect). If this probability is low enough (5% or under, preferably under) we can reject the null hypothesis. Here R helps us a bit and marks with asterisks the p-values that are statistically significant. But remember to not blindly trust this value!

The Last part of the summary

The last part of the summary is for assessing the fit and overall significance of our model.

Residual Standard Error Residual error gives the standard deviation of the residuals.

Multiple R-squared

R-squared tells the proportion of the variance in the dependent variable that can be explained by the explanatory variables. This means that the R-squared shows us how well the data fit our regression model. This figure does not give us any information about the causation relationship between the dependent and explanatory variables. It also does not indicate the correctness of the model. The higher the number, the more variability is explained. Our model is not clearly explaining our variables well. However, this number is also something we need to be cautious about. High score can indicate also that our model is overfitting to our data.

F-Statistic F-test will tell you if means are significantly different.F-test will tell us if the group of variables are jointly significant.

Removing variables that are not statistically significant

The previous summary showed us that only the attitude had a significant relationship with the dependent variable. In the following model only attitude is used in the model.

# Fit a regression model where `points` is the target variable and `attitude`, `stra` and `surf` are the explanatory variables.
linearmodelfitted <- lm(points ~ attitude, data = learning2014)

# print out the summary of the new model
summary(linearmodelfitted)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Compare the results with the previous model

We can see that there is no much change and that actually multiple R-square value is lower. Based on this, we can say that it does not seem that the data does not fit any better to this simpler model.

Diagnostic Plots

In the plot function we can select from six different plots (at least). Here we have chosen number 1, i.e., Residual vs Fitted values, number 2, Normal QQ-plot and number 5, Residuals vs Leverage.

which graphic
1 Residuals vs Fitted values
2 Normal QQ-plot
3 Standardized residuals vs Fitted values
4 Cook’s distances
5 Residuals vs Leverage
6 Cook’s distance vs Leverage

Residuals vs Fitted values When this scatterplot shows a good fit for the linear model, the points appear to be randomly spread out about the line and there is no apparent non-linear trends or indications of non-constant variable. The red line which shows the average value of the residuals at each value of fitted value is perfectly flat. This also tells us that there is no discernible non-linear trend to the residuals. The residuals should also be equally variable across the entire range of fitted values.

In the last model, where we only have attitude as the explanatory variable, the residuals vs Fitted values plot at first sight seems to be quite OK. However,the red line starts to go up somewhere around the Fitted value of 24. This indicates that there are some non-constant variance in our errors. The scatterplots are also gathered more evenly from around 20 to 26, below 20 and above 26 the scatterplot is not as even.This means that we should not believe our confidence intervals, prediction bands or the p-values in our regression.

Normal QQ-plot The QQ plot helps us to assess if the residuals are normally distributed. When the residuals are matching perfectly the diagonal line, they are normally distributed.

In our model we can see that the lower tail is heavier, i.e., they have larger values than what we would expect under the standard modeling assumptions. The upper tail on the other hand is lighter indicating that we have smaller values than what we would expect under the standard modeling assumptions. This means that our residuals are not normally distributed.

Residuals vs Leverage One way of defining outliers are that they are points that are not approximated well by the model. This means that they have a large residual. This significantly influences model fit. Residuals vs Leverage is a way to have an estimate about the outliers.

In a perfectly fit model, the Cook’s distance curves do not even appear on the plot. None of the points come close to have both high residual and leverage.

In our model, we have a few point that have a very high residual and some that have very high leverage.

Looks to me that there might be some outliers in the data that should be taken into consideration (delete from the dataset?) to make the model better fit for the data. The first scatterplots in the plotmatrix also indicated the possibility of outliers so this result of the analysis is no surprise.

Definitely our model need some further adjustment (at least) to be able to explain the dependent variable with the explanatory variables. However, the plots do not show anything that bad that would directly indicate that our explanatory variables would be completely useless either.

par(mfrow = c(2,2))
plot(linearmodelfitted, which = c(1,2,5))

Additional sources: https://boostedml.com/2019/06/linear-regression-in-r-interpreting-summarylm.html https://dss.princeton.edu/online_help/analysis/interpreting_regression.htm https://blog.minitab.com/en/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients https://www.statology.org/intercept-in-regression/#:~:text=The%20intercept%20(sometimes%20called%20the,model%20are%20equal%20to%20zero.


(more chapters to be added similarly as we proceed with the course!)

Logistic regression

date()
## [1] "Mon Dec  5 23:20:38 2022"

Bring in the libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(readr)
library(finalfit)
library(GGally)
library(boot)

Read the table to dataframe alc

alc <- read.table("data/alc.csv", sep=",", header=TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)
## [1] 370  35
str(alc)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

This data were collected by using school reports and questionnaires in order to study student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features). Alc data consists of two datasets: Mathematics (mat) and Portuguese language (por). The two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. (Source: https://archive.ics.uci.edu/ml/datasets/Student+Performance)

Analysis

The hypotheses:

  • h0 There is no relationship between the students alcohol consumption and the absences, sex, family support and mother’s education level.
  • h1 The alcohol consumption is higher with students who have high number of absences.
  • h2 The alcohol consumption is higher among male than female students.
  • h3 The alcohol consumption is higher among male students whose mothers’ education level is low than female students whose mothers’ education level is low.
  • h4 The alcohol consumption is higher among male students who do not receive family support than female students who do not receive family support.
  • h5 The alcohol consumption is higher among male students with high absences number than female students with high absences number.
  • h6 The alcohol consumption is higher among students who have high number of absences, who are male, have no family support and whose mothers’ education level is low.

Check the summary statistics of the dataset grouped by sex

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <chr> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       11.4
## 2 F     TRUE        41       11.8
## 3 M     FALSE      105       12.3
## 4 M     TRUE        70       10.3

Here we can see that there are 154 female students whose alcohol consumption is not high (i.e. combined workday and weekend alcohol consumption is less than 2). Their mean final grade was 11.4. There were 41 females whose alcohol consumption is defined a high and their mean final grade was 11.8. For male students, there were 105 whose alcohol consumption was low and their mean final grade was 12.3. And for the male students whose alcohol consumption was high, their mean final grade was 10.3.

From this, it seem quite obvious that at least for male students, the alcohol consumption is very likely to affect the students school success. For female it seems to be the opposite but the difference is much smaller so the relationship is does not seem so obvious.

Let’s check how the boxplot looks like:

# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade") + ggtitle("Student final grade distribution by alcohol consumption and sex")

From the boxplot we can see that for females the students with low alcohol consumption the variance in final grade is larger and even though the mean is a bit lower than with the females define with high alcohol consumption their overall grade reaches higher. So this might explain the slight higher mean in females with high alcohol consumption.

We’ll take another variable, absences. First we check the numbers of high alcohol use students and their means of absences and then see how its distribution by alcohol consumption and sex looks like:

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absence = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_absence
##   <chr> <lgl>    <int>        <dbl>
## 1 F     FALSE      154         4.25
## 2 F     TRUE        41         6.85
## 3 M     FALSE      105         2.91
## 4 M     TRUE        70         6.1
# initialize a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")

We can see that the number look quite similar with the means of final grades. Both females and males who are defined as high in their alcohol consumption also have higher mean in the number of school absences. This would seem to confirm our hypothesis h1, “the alcohol consumption is higher with students who have high number of absences.”

Just out of curiosity, lets take two more variables, medu and famsup. Medu is the mother’s education level. This variable has a numeric value ranging from 0 to 4 where 0 is no education, 1 is praimary education (4th grade), 2 is 5th to 9th grade, 3 is secondary education and 4 is higher education. Famsup is family support and it has two values, yes and no depending if the student receives educational support from family or not.

Let’s see the corresponding summaries first:

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_medu = mean(Medu))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_medu
##   <chr> <lgl>    <int>     <dbl>
## 1 F     FALSE      154      2.69
## 2 F     TRUE        41      2.76
## 3 M     FALSE      105      2.96
## 4 M     TRUE        70      2.81
alc %>% group_by(sex, high_use, famsup) %>% summarise(count = n())
## `summarise()` has grouped output by 'sex', 'high_use'. You can override using
## the `.groups` argument.
## # A tibble: 8 × 4
## # Groups:   sex, high_use [4]
##   sex   high_use famsup count
##   <chr> <lgl>    <chr>  <int>
## 1 F     FALSE    no        47
## 2 F     FALSE    yes      107
## 3 F     TRUE     no        11
## 4 F     TRUE     yes       30
## 5 M     FALSE    no        47
## 6 M     FALSE    yes       58
## 7 M     TRUE     no        34
## 8 M     TRUE     yes       36

From these numbers it seems that there are no big differences between the students whose alcohol consumption is high and those whose alcohol consumption is low. But lest see the box plot for mother’s educational level. Because the Family support is a binary variable, drawing a box plot would not make it very informative. So we do not draw that.

# initialize a plot of high_use and absences
g3 <- ggplot(alc, aes(x = high_use, y = Medu, col = sex))

# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("Mother's education level") + ggtitle("Mother's education level by students alcohol consumption and sex")

We can draw a plot box to see the relationship of alcohol consumption, family support and absence, so we can do that. This box plot is not now directly comparable with the previous tables but we can see that there is some relationship between these three variables too.

# initialize a plot of high_use and absences
g4 <- ggplot(alc, aes(x = high_use, y = absences, col = famsup))

# define the plot as a boxplot and draw it
g4 + geom_boxplot() + ggtitle("Mother's education level by students alcohol consumption and sex")

From the box plots we can already see that our hypothesis h3 (The alcohol consumption is higher among male students whose mothers’ education level is low than female students whose mothers’ education level is low.) is not true. There is no significant difference between the male students whose alcohol consumption is high than with the female students with high alcohol consumption. Actually, we can see that there is no difference between the male and female students whose alcohol consumption is low either. So it seems, as was noted earlier that mothers’ education level does not have any relationship with the students’ alcohol consumption.

There is some evidence showing that the hypotheses h4 and h5, i.e., “h4 The alcohol consumption is higher among male students who do not receive family support than female students who do not receive family support” and “h5 The alcohol consumption is higher among male students with high absences number than female students with high absences number.” Are true but it is not clear how significant this relationship is.

Finally, lets see a summary. It appears that students whose alcohol use is defined as high have higher number of absences. they and are likelier to be male. There is a slight higher number of students whose alcohol consumption is high with high alcohol use but that difference is not significant. And lastly, as we have already discovered, there is basically no relationship between mothers’ education level and the alcohol use. I would normally drop the mothers education level out from further analysis but here I keep it just for the exercise’s sake. I would probably keep the family support variable but do further analysis to select the most parsimonious model that gives an adequate description of the data.

dependent <- "high_use"
explanatory <- c("absences", "sex", "famsup", "Medu")
alc %>% 
  summary_factorlist(dependent, explanatory, p = TRUE,
                     add_dependent_label = TRUE)
##  Dependent: high_use                FALSE      TRUE      p
##             absences Mean (SD)  3.7 (4.5) 6.4 (7.1) <0.001
##                  sex         F 154 (59.5) 41 (36.9) <0.001
##                              M 105 (40.5) 70 (63.1)       
##               famsup        no  94 (36.3) 45 (40.5)  0.512
##                            yes 165 (63.7) 66 (59.5)       
##                 Medu Mean (SD)  2.8 (1.1) 2.8 (1.1)  0.933

Logistic regression

Let’s explore the distributions of the chosen variables and their relationships with alcohol consumption. Lets first check that the explanatory variables, i.e., sex, absences, family support and mother’s education level are not correlated with one another. The ggpairs() function is a good way of visualising all two-way associations

explanatory <- c("absences", "famsup", "Medu")
alc %>% 
  remove_labels() %>%
  ggpairs(columns = explanatory, ggplot2::aes(color=sex))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can check the precense of high-order correlations with variance inflation factor which can be calculated for each of the terms.Variance inflation factor tells us how much the variance of a particular regression coefficient is increased due to the presence of multicollinearity in the model. GVIF stands for generalised variance inflation factor. According to the R for Health Data Science book, “a common rule of thumb is that if this is greater than 5-10 for any variable, then multicollinearity may exist. The model should be further explored and the terms removed or reduced”. None of the factors are greater than 5-10 so there is no suggestion of any multicollinearity existing.

dependent <- "high_use"
explanatory <- c("absences", "sex", "Medu", "famsup")
alc %>% 
  glmmulti(dependent, explanatory) %>%
  car::vif()
## absences      sex     Medu   famsup 
## 1.032858 1.064500 1.052280 1.053462

We use glm() to fit a logistic regression model with high_use as the target variable and absences, sex, Medu, and famsup as the predictors.

# find the model with glm()
m <- glm(high_use ~ absences + sex + Medu + famsup, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + sex + Medu + famsup, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2968  -0.8763  -0.6067   1.1127   2.0637  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.585134   0.382978  -4.139 3.49e-05 ***
## absences     0.099038   0.023618   4.193 2.75e-05 ***
## sexM         1.057917   0.249816   4.235 2.29e-05 ***
## Medu        -0.102692   0.113867  -0.902    0.367    
## famsupyes   -0.006895   0.251175  -0.027    0.978    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 414.82  on 365  degrees of freedom
## AIC: 424.82
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
##  (Intercept)     absences         sexM         Medu    famsupyes 
## -1.585134370  0.099037567  1.057917219 -0.102691836 -0.006895319
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2049203 0.09470666 0.4265125
## absences    1.1041078 1.05649976 1.1590633
## sexM        2.8803656 1.77683802 4.7406281
## Medu        0.9024050 0.72142713 1.1285336
## famsupyes   0.9931284 0.60860371 1.6321229

From the Coefficients we can see that it is safe to say that the null hypothesis that there is no relationship between the students alcohol consumption and the absences, sex, family support and mother’s education level.It is clear that the number of absences and being a man have a relationship with high alcohol consumption. This seems to also confirm that our hypothesis h2, “the alcohol consumption is higher among male than female students” is also true.

We already discarder the hypothesis h3, “the alcohol consumption is higher among male students whose mothers’ education level is low than female students whose mothers’ education level is low” and it seems that also hypothesis h4 “the alcohol consumption is higher among male students who do not receive family support than female students who do not receive family support” is also not true. For both variables, family support and mother’s education level the Odds ration is close to 1 which suggest that these variables have no affect or the affect is very limited to students alcohol consumption. This is further confirmed with the confidence interval levels, as the CI of both variables is rather small.

The model seems to confirm the hypothesis h5, that the alcohol consumption is higher among male students with high absences number than female students with high absences number. This is because of the coefficients show high significance for male sex and absences. This is further confirmed with the odds ratio and CI levels of these variables. Although, the odd ratio for absences is just a little bit over 1.

The last hypothesis h6, “the alcohol consumption is higher among students who have high number of absences, who are male, have no family support and whose mothers’ education level is low’, is also not true as family support and mother’s education level does not have a significant relationship with the students alcohol consumption.

Here Odds ratio plot which might bring what is said above more clear. Odds ratio value of 1 is the null effect. This means that if the horizontal line crosses this line it does not illustrate statistically significant result.

dependent <- "high_use"
explanatory_multi <- c("absences", "sex", "Medu", "famsup")
alc %>% 
  or_plot(dependent, explanatory_multi,
          breaks = c(0.5, 1, 2, 5, 10, 25),
          table_text_size = 3.5,
          title_text_size = 16)
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Warning: Removed 2 rows containing missing values (`geom_errorbarh()`).

The Predictive power of the model

Let’s check how well does our model actually predicts the target variable. We use the predict() function to make predictions with a model object. We use the ‘type’ argument of predict() to get the predictions as probabilities (instead of log of odds, which is the default).

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# look at the first ten observations of the data, along with the predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% head(10)
##    failures absences sex high_use probability prediction
## 1         0        5   F    FALSE   0.1823191      FALSE
## 2         0        3   F    FALSE   0.1981958      FALSE
## 3         2        8   F     TRUE   0.2899708      FALSE
## 4         0        1   F    FALSE   0.1296836      FALSE
## 5         0        2   F    FALSE   0.1542003      FALSE
## 6         0        8   M    FALSE   0.4619290      FALSE
## 7         0        0   M    FALSE   0.3246243      FALSE
## 8         0        4   F    FALSE   0.1670547      FALSE
## 9         0        0   M    FALSE   0.3010742      FALSE
## 10        0        0   M    FALSE   0.3010742      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   251    8
##    TRUE     84   27

The cross tabulation of the predictions versus the actual values (i.e., the confusion matrix) shows that the precision of this model was 27/35, i.e. around 77% and the recall was 27/111, i.e., around 24%

The F1 score of the model is: True Positives/(True Positives + ((False Negatives + False Positives)/2)) which in our case is around 37%. This means that our model does not work very well at all.

If we compare with the numbers in Exercise3, this model is slightly worse. When we would compare this model with simple guessing, simple guessing that someone has a high alcohol consumption would be 50%, so our model is worse than just a guess.

Simple measure of performance

Lets check the penalty function of our model. The less we have penalty the better the model.The loss function of logistic regression shows the loss (the penalty if we predict 0 (in the first calculation) or 1 (in the second calculation)). Here we can see that the cost of predicting 0 is 0.3 which is not too bad but the cost of predicting 1 is 07, which on the other hand is not good at all.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486486

Bonus: 10-Fold Cross-Validation

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2567568

This model seemed to do slightly better than the model in Exercise3. However, it still did not perform very well. If we want to have model with better performance, there are plenty of ways to do it. Good idea is also to use more time in selecting the variables. You can for example, calculate the R-squared for the last variable added to the model. This will tell how much the R-squared increased for each variable so it will represent the improvement in the goodness-of-fit. (see for example, https://statisticsbyjim.com/regression/identifying-important-independent-variables/)


Clustering and Classification

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. The dataset contains a total of 506 cases. There are 14 attributes in each case of the dataset.

Variables in order:

  1. CRIM - per capita crime rate by town
  2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS - proportion of non-retail business acres per town.
  4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX - nitric oxides concentration (parts per 10 million)
  6. RM - average number of rooms per dwelling
  7. AGE - proportion of owner-occupied units built prior to 1940
  8. DIS - weighted distances to five Boston employment centres
  9. RAD - index of accessibility to radial highways
  10. TAX - full-value property-tax rate per $10,000
  11. PTRATIO - pupil-teacher ratio by town
  12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT - % lower status of the population
  14. MEDV - Median value of owner-occupied homes in $1000’s
# libraries
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(GGally)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded

Load the dataset and check how the data looks

Load the Boston dataset

# load the data
data("Boston")

Explore the Dataset

Structure:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Dimensions:

dim(Boston)
## [1] 506  14

Summary of the dataset:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

From the summary we can notice that many variables have quite interesting values. This is because of many of the variables are either dummy variables (like chas) or they depict proportions, like for example, zn and indus. This is something we do not need to worry but need to keep in mind when analysing some of the numbers.

Graphical overview od the data

https://imjbmkz.medium.com/analyzing-boston-housing-dataset-2c7928f2a87f https://rstudio-pubs-static.s3.amazonaws.com/388596_e21196f1adf04e0ea7cd68edd9eba966.html

# plot matrix of the variables
boston_plot <- pairs(Boston[6:10])

From the plotmatrix we can see that most of the variables seem to have some sort of relationship with each other. This suggest that we better think these variables together using multivariate analysis.

Boxplot matrix

boxplot(as.data.frame(Boston))

By observing the boxplots, we can see that the box plot for tax is relatively tall.This is because the values in this variable are much bigger and also the difference between the minimum and maximum is larger. This is also why the position of the tax box plot is much higher than the other plots. Because of the tax, the scale is large. It is better to check box plots of the other variables separately.

boxplot(as.data.frame(Boston[1:9]))

Now we can see that the variables crim and zn seem to have quite a lot of outliers. Chas is a bit special variable because it is a conditional and categorical variable, getting 1 if tract bounds river and 0 if it doesn’t.For zn and rad the median is much closer to the minimum value than the maximum. Let’s check the four last variables ptratio, black, lstat and medv.

PTRATIO - pupil-teacher ratio by town and B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT - % lower status of the population 14. MEDV - Median value of owner-occupied homes in $1000’s

boxplot(as.data.frame(Boston[11]))

boxplot(as.data.frame(Boston[12]))

boxplot(as.data.frame(Boston[13:14]))

From the boxplots we can see that also the variables black and medv seem to have large number of outliers. Let’s check how the outliers that lie outside of the interval formed by the 1 and 99 percentiles. We could do the same with other varibales were we noticed some possible outliers and then make the decision if we should live them or remove them from the dataset. Here we keep them baring in mind that these outliers might impact our results in the end.

lower_bound <- quantile(Boston$black, 0.01)
upper_bound <- quantile(Boston$black, 0.99)

outlier_ind <- which(Boston$black < lower_bound | Boston$black > upper_bound)

Boston[outlier_ind, ]
##         crim zn indus chas   nox    rm   age    dis rad tax ptratio black lstat
## 411 51.13580  0  18.1    0 0.597 5.757 100.0 1.4130  24 666    20.2  2.60 10.11
## 424  7.05042  0  18.1    0 0.614 6.103  85.1 2.0218  24 666    20.2  2.52 23.29
## 425  8.79212  0  18.1    0 0.584 5.565  70.6 2.0635  24 666    20.2  3.65 17.16
## 451  6.71772  0  18.1    0 0.713 6.749  92.6 2.3236  24 666    20.2  0.32 17.44
## 455  9.51363  0  18.1    0 0.713 6.728  94.1 2.4961  24 666    20.2  6.68 18.71
## 458  8.20058  0  18.1    0 0.713 5.936  80.3 2.7792  24 666    20.2  3.50 16.94
##     medv
## 411 15.0
## 424 13.4
## 425 11.7
## 451 13.4
## 455 14.9
## 458 13.5
# source: https://statsandr.com/blog/outliers-detection-in-r/
# Correlation Matrix
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.7)

Standardize the dataset and create a new variable, crime rate

Standardize the dataset

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables with standardized variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# check that we succeeded and we have a dataframe
class(boston_scaled)
## [1] "data.frame"

With standardization we bring all the varibales into the same scale, so that the mean is always 0. This is especially useful when doing multivariate analysis and especially when we do clustering. In clustering variables with very different scales can mess the calucalations as the idea is to calculate the distances between each pair of the n individuals (the rows in a dataframe).

Categorical variable of crime rate

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

First we check how the scaled variable crime rate looks like. This is to decide how many parts we want to divide this varibale. It seems reasonable to use the quantile division. As we can now use this categorical variable, which is better when creating clusters, we can also get rid of the crim variable.

Divide the dataset to train set and test set

In order to have any idea, how our clustering works, we need to divide the dataset into two parts. Train set is 80% of the dataset and this we will use to train our model. The rest 20% we are going to use as test set and see, how well our model actually worked. Because we don’t want the test set to know the right answers already beforehand, we remove the crime variable from the test set. We use it later to evaluate how well we managed to cluster the test data with our model.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
n
## [1] 506
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data [part of task 6]
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Use the Linear Discriminant Analysis to train our data

First we fit our data to the model. The lda function takes a formula (like in regression) as a first argument. We use the crime as a target variable and all the other variables as predictors.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2747525 0.2252475 0.2450495 0.2549505 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       0.9796600 -0.9214487 -0.09498221 -0.8895954  0.42805959 -0.8929406
## med_low  -0.1100627 -0.3138224 -0.01274004 -0.5636249 -0.12031515 -0.2038284
## med_high -0.3737674  0.1638884  0.08558914  0.3982944  0.01550043  0.3653322
## high     -0.4872402  1.0170891 -0.04298342  1.0798472 -0.41026005  0.8138718
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9115106 -0.6869945 -0.7342636 -0.39808849  0.37579277 -0.75557195
## med_low   0.2841526 -0.5590839 -0.4923806 -0.05001551  0.32273678 -0.10214684
## med_high -0.3798720 -0.4157582 -0.3228302 -0.31679163  0.05131457  0.03227421
## high     -0.8540780  1.6384176  1.5142626  0.78111358 -0.79957658  0.89083393
##                 medv
## low       0.50570021
## med_low  -0.02340127
## med_high  0.13108230
## high     -0.72276766
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.12374801  0.634350976 -0.69272313
## indus    0.02769081 -0.328314479  0.17946008
## chas    -0.02208510  0.006354361  0.13789517
## nox      0.44941191 -0.606666844 -1.64832292
## rm       0.01476024  0.034284190 -0.08109494
## age      0.21741582 -0.259993622  0.24636909
## dis     -0.06870186 -0.097521333 -0.03465966
## rad      3.07591045  1.061854624 -0.09703329
## tax      0.06320096 -0.131105027  0.60983810
## ptratio  0.14826575  0.078704297 -0.23765293
## black   -0.10909045  0.011954161  0.12106973
## lstat    0.21376251 -0.241551323  0.32554633
## medv     0.09301365 -0.443080945 -0.29983002
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9503 0.0382 0.0115
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Make predictions with our test data

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13       2        1    0
##   med_low   11      18        6    0
##   med_high   1       9       16    1
##   high       0       0        0   24

I would say, without doing anything else than just scaling the dataset this is not too bad. The model predicts correctly all the highs, 80% of the med_higs, and 52% of the med_lows and 65% of the lows. Better than just guessing!

Calculate the distances

Let’s use the scaled dataset to calculate the distances between each n. Fist, we use the Euclidean distance measure, which is a straight line distance between 2 data points. Second, we use Manhattan distance, which is the distance between two points in an N-dimensional vector space.Manhattan Distance is preferred over the Euclidean distance metric when the dimension of the data increases.

Source: https://medium.com/@kunal_gohrani/different-types-of-distance-metrics-used-in-machine-learning-e9928c5e26c7

boston_scaled_new <- as.data.frame(scale(Boston))

# euclidean distance matrix
dist_eu <- dist(boston_scaled_new)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled_new, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

K-means algortihm

Firs we set the number of clusters to 3 and see how the clustering works by using ggpairs as the visualization tool:

# k-means clustering
km <- kmeans(boston_scaled_new, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled_new[6:10], col = km$cluster)

Then we investigate what would be the optimal number of clusters and run the algorithm again.

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10
k_max
## [1] 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

# k-means clustering
km <- kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)

After inspecting the resulst using within cluster sum of square we can see from the picture that total WCSS drops radically at around 2 clusters, hence we choose 2 as the optimal number of clusters in this model. After we plot Boston dataset using with the clusters, we plot the same variables as before just to make it easier to see the differences. We can see by comparing the plot with three clusters and with 2 clusters that two clusters seem to form groups that have less overlapping then with the 3 cluster plot.

Bonus tasks

Perform k-means on the standardized Boston data with some reasonable number of clusters (> 2)

set.seed(13)
# k-means clustering
km_boston <- kmeans(boston_scaled_new, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled_new[6:10], col = km$cluster)

Perform LDA using the clusters as target classes.

All the the variables in the Boston data in the LDA model are included. After fitting the data to the model, we visualize the results with a biplot.

boston_scaled_bonus <- as.data.frame(scale(Boston))
# create a categorical variable 'crime' abd remove crim
bins <- quantile(boston_scaled_bonus$crim)
crime <- cut(boston_scaled_bonus$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# add the new categorical value to scaled data
boston_scaled_bonus <- data.frame(boston_scaled_bonus, crime)

boston_scaled_bonus <- dplyr::select(boston_scaled_bonus, -crim)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = boston_scaled_bonus)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = boston_scaled_bonus)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2509881 0.2490119 0.2490119 0.2509881 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.96399896 -0.9118363 -0.11732512 -0.8783001  0.4440813 -0.8853416
## med_low  -0.09675198 -0.2910375 -0.02235445 -0.5655031 -0.1339316 -0.3391655
## med_high -0.38379059  0.1853483  0.19637334  0.3869312  0.1006223  0.4124508
## high     -0.48724019  1.0166933 -0.05532354  1.0554659 -0.4110343  0.8126334
##                 dis        rad        tax     ptratio      black      lstat
## low       0.8777688 -0.6897808 -0.7381779 -0.44318464  0.3787346 -0.7705123
## med_low   0.3516162 -0.5480059 -0.4770689 -0.06194383  0.3227717 -0.1325955
## med_high -0.3764265 -0.4121951 -0.3080609 -0.28336515  0.0872293  0.0127299
## high     -0.8531538  1.6424211  1.5171256  0.78577465 -0.7855072  0.8894341
##                  medv
## low       0.525956037
## med_low  -0.002531505
## med_high  0.170832244
## high     -0.692931573
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.101362020  0.71021035 -0.94401212
## indus    0.027739389 -0.25722619  0.28582296
## chas    -0.071824244 -0.05644265  0.08605965
## nox      0.368585037 -0.73079397 -1.36969686
## rm      -0.079177768 -0.07808680 -0.16397501
## age      0.251832283 -0.31611016 -0.14279184
## dis     -0.072503180 -0.26487036  0.16558054
## rad      3.251389429  0.93949138 -0.06819986
## tax      0.008751469 -0.01233004  0.56822607
## ptratio  0.122582335  0.02144323 -0.28114348
## black   -0.126522553  0.02006734  0.13121934
## lstat    0.209706055 -0.22672353  0.37709356
## medv     0.169105049 -0.39371422 -0.19149321
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9512 0.0366 0.0123
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(boston_scaled_bonus$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

From the biplot we can see that rad explains more of the high crime, zn the low and med_low and nox more the med_high crime rates. However, only rad is influential enough to separate the high crime class very cleary although, in this model there are still some med_highs included too.

Super-Bonus

  1. Super-Bonus:

Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

The code matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling did not work. In this page: https://stats.stackexchange.com/questions/82497/can-the-scaling-values-in-a-linear-discriminant-analysis-lda-be-used-to-plot-e It says: “data.lda <- data.frame(varnames=rownames(coef(iris.lda)), coef(iris.lda)) #coef(iris.lda) is equivalent to iris.lda$scaling” This is done for the iris dataset but the idea is the same, so that is why I changed the lda.fit$scaling to coef(lda.fit)

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% coef(lda.fit)
matrix_product <- as.data.frame(matrix_product)

# Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)

Dimensionality Reduction Techniques

date()
## [1] "Mon Dec  5 23:20:57 2022"

The data used in the first part is modified from the UNDP Human development Reports Human development Index dataset.The HDI is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living. It also includes some measures of gender equality.

More information about the Human development index can be found from https://hdr.undp.org/data-center/human-development-index#/indicies/HDI

In the dataset the name of the country is the row name

Here we use the following variables:

  1. Life_Exp = Life Expectancy at Birth
  2. Exp_Edu = Expected Years of Education for children entering school
  3. GNI_Capita = Gross National Income (GNI) per Capita
  4. Mater_Mort = Maternal Mortality Ratio
  5. Ado_Birth = Adolescent Birth Rate
  6. Parli_F = Percentage of female representatives in parliament
  7. Edu2_ratio is caluculated ratio of female and male populations with secondary education in each country (Female/Male)
  8. Labor_ratio is calculated ratio of labor force participation of females and males in each country

Bring in the libraries

library(dplyr)
library(readr)
library(tidyr)
library(corrplot)
library(GGally)
library(ggplot2)
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Load the data

human <- read.table("data/human.csv", sep=",", header=TRUE)

Check how the data looks like

colnames(human)
## [1] "Edu2_ratio"  "Labor_ratio" "Life_Exp"    "Exp_Edu"     "GNI_Capita" 
## [6] "Mater_Mort"  "Ado_Birth"   "Parli_F"
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2_ratio : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labor_ratio: num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life_Exp   : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Exp_Edu    : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI_Capita : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mater_Mort : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado_Birth  : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli_F    : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

Graphical overview of the data

ggpairs(human)

Summaries of the variables in the data

summary(human)
##    Edu2_ratio      Labor_ratio        Life_Exp        Exp_Edu     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##    GNI_Capita       Mater_Mort       Ado_Birth         Parli_F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
cor(human) %>% corrplot()

The outputs confirm what we expect. The education ratio is positively correlated with Life expectancy, GNI per capita, and (this is pretty obvious) with Expected education. In countries where high numbers of women are educated, people live longer, study more and earn more. They also have lower maternal mortality rate and adolescent birth rate.

Life expectancy is positively correlated with GNI per capita. People in richer countries live longer. Life expectancy is negatively correlated with maternal mortality rate and adolscent birth rate. When maternal mortality rate and adolescent birth rate are positively correlated, it seems quite natural that when these two rates are high the life expectancy gets lower.

Expected education is positively correlated with GNI per capita. In richer countrie people study longer but also when people study longer they earn more. High expected education also lowers the maternal mortality rate and adolescent birth rates. When women stay in school longer, they have less births and they also have babies later, which both have an impact on maternal mortality rates.

In countries with high GNI per capita both maternal mortality rate and adolescent birth rate are lower. This seems also something you would expect. High income countries have more means for good public healthcare which lowers maternal birthrates and women who give birth later in their lives have usually better education so they also earn more money.

The distributions between countries in some variables are very large. For examle, the education ratio is over 8 times larger in countries with high values than with low values. What is more, the highest value indicates that more women are participating in secondary education than men.GNI per capita, maternal mortality rate and adolescent birth rate tell a story of how unequal this world is. When the lowest GNI per capita is just under 600 the highest is over 200 times higher, 123,124. When in some countries only 1 woman dies because of child birth in other countries over 1100 will perish. And when in some countries the adolescent birth rate is less than 1 in other countries it is over 200.

Principal Componen Analysis

Perform the analysis on the non-standardized data

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Perform the PCA on stadardized data

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##    Edu2_ratio       Labor_ratio         Life_Exp          Exp_Edu       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##    GNI_Capita        Mater_Mort        Ado_Birth          Parli_F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)

summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

human2 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", 
                    sep =",", header = T)
human2_std <- scale(human2)
summary(human2_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human2 <- prcomp(human2_std)
biplot(pca_human2, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

I added here the table from the Exercise 5 where the country varibale is still in use. It is not much clearer (maybe the opposite) but we can see the names of some of the countries, so it might be helpful.

The difference is mainly explained with the difference of the units between the different variables. When the ratios are very small, usually around 1, GNI per capita has the highest value of over 100,000. If we use covariance matrix, like we did with the first version of the PCA, the variables with largest variances will dominate the early components. This is also why GNI_Capita variable is alone in the first biplot and all the others are all very near each other. This is also clear if we check the summary information of the first principal component analysis, according which the PC1 is explaining nearly all the variance between the components.

When we use correlation matrix instead, we standardize the unit variance. This means that all the variables in this case are equally important.We can see this from the summary table where we can see that the first component explains a little bit over 50% of the variance and we need 2 components to be able to explain just a little under 70% of the variance.

PCA is used to find the number of components that are able to provide us an adequate summary of the dataset. We can use the scree diagram to do this.

screeplot(pca_human_std, type = "l", main = "Scree plot for standardized human data")
abline(1,0, col='red', lty = 2)

Here we can see that the scree plot suggest that two components would be enough to describe the data. This is actully the number of components we have in our biplots as well. From the biplot we can see that as the angle between the arrows indicate correlation between the variables, the mater_mort and ado_birth are quite strongly correlated with each other, the same goes with Life_Exp, Exp_Edu, GNI_Capita and Edu2_ratio. This what we also saw already from the correlation plots at the beginning.

Tea Data

The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions). Here is the description: https://cran.r-project.org/web/packages/FactoMineR/FactoMineR.pdf

Load the tea dataset and convert its character variables to factors

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

Explore the data

Structure:

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Dimensions:

dim(tea)
## [1] 300  36

Browse the dataframe’s contents

View(tea)

Visualize the data

# Let's see how different tea drinkers perceive tea: Do they think it's relaxing, sophisticated or are tea drinkers friendly (I assume that the friendliness means this, I couldn't find )

#Firs we define which columns we want to keep
keep_columns <- c("Tea", "How", "how", "sugar", "where", "age_Q", "relaxing", "sophisticated", "friendliness")

#create a new dataset with the variables we chose to keep
tea_percep <- select(tea, one_of(keep_columns))

#check how the dataframe looks like
summary(tea_percep)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                                                                     
##                   where       age_Q           relaxing  
##  chain store         :192   +60  :38   No.relaxing:113  
##  chain store+tea shop: 78   15-24:92   relaxing   :187  
##  tea shop            : 30   25-34:69                    
##                             35-44:40                    
##                             45-59:61                    
##            sophisticated           friendliness
##  Not.sophisticated: 85   friendliness    :242  
##  sophisticated    :215   Not.friendliness: 58  
##                                                
##                                                
## 
str(tea_percep)
## 'data.frame':    300 obs. of  9 variables:
##  $ Tea          : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How          : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how          : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar        : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where        : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ age_Q        : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ relaxing     : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ sophisticated: Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
#visualize the dataset
pivot_longer(tea_percep, cols = everything()) %>%
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar() + theme(axis.text = element_text(angle = 35, hjust = 1, size = 10))

From the plots (which are clearer if you see them with R studio) we can see that tea drinking is more common among young people than among the old. Most people connect tea drinking with friendliness a little bit less common is to connect tea drinking with sophistication. Nearly 2/3 seem to think that drinking tea is relaxing. Most people drink their tea just as it is, without adding lemon, milk or something else. However, adding sugar divides the group almost to two equal sized groups. Majority of the people surveyed buy their from chain stores and they also usually buy tea bags. And when they buy thea, it most commonly is Earl Grey.

Multiple Correspondnce Analysis

# multiple correspondence analysis
tea_mca <- MCA(tea_percep, graph = FALSE)

# summary of the model
summary(tea_mca)
## 
## Call:
## MCA(X = tea_percep, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.219   0.180   0.171   0.145   0.140   0.123   0.119
## % of var.             11.596   9.548   9.076   7.695   7.397   6.490   6.295
## Cumulative % of var.  11.596  21.144  30.220  37.916  45.312  51.802  58.096
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.111   0.100   0.098   0.095   0.091   0.079   0.069
## % of var.              5.886   5.305   5.187   5.006   4.837   4.181   3.638
## Cumulative % of var.  63.982  69.287  74.474  79.480  84.318  88.499  92.137
##                       Dim.15  Dim.16  Dim.17
## Variance               0.064   0.048   0.036
## % of var.              3.404   2.529   1.931
## Cumulative % of var.  95.540  98.069 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.179  0.049  0.014 | -0.143  0.038  0.009 |  0.662
## 2                  |  0.007  0.000  0.000 | -0.205  0.078  0.018 |  1.197
## 3                  | -0.261  0.104  0.058 | -0.226  0.094  0.043 |  0.411
## 4                  | -0.668  0.679  0.368 | -0.321  0.191  0.085 | -0.070
## 5                  | -0.261  0.104  0.058 | -0.226  0.094  0.043 |  0.411
## 6                  | -0.655  0.652  0.298 | -0.242  0.108  0.041 |  0.289
## 7                  | -0.349  0.185  0.064 | -0.149  0.041  0.012 |  0.476
## 8                  |  0.083  0.010  0.004 |  0.163  0.049  0.014 |  0.611
## 9                  |  0.129  0.025  0.007 |  0.630  0.733  0.162 |  0.166
## 10                 |  0.265  0.107  0.032 |  0.806  1.200  0.301 |  0.227
##                       ctr   cos2  
## 1                   0.851  0.189 |
## 2                   2.785  0.604 |
## 3                   0.328  0.143 |
## 4                   0.010  0.004 |
## 5                   0.328  0.143 |
## 6                   0.163  0.058 |
## 7                   0.440  0.119 |
## 8                   0.725  0.200 |
## 9                   0.053  0.011 |
## 10                  0.100  0.024 |
## 
## Categories (the 10 first)
##                       Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test
## black              |  0.594  4.417  0.116  5.879 |  0.435  2.878  0.062  4.306
## Earl Grey          | -0.339  3.745  0.207 -7.867 |  0.013  0.006  0.000  0.294
## green              |  0.649  2.351  0.052  3.946 | -1.050  7.471  0.136 -6.383
## alone              | -0.055  0.100  0.006 -1.296 | -0.255  2.614  0.121 -6.020
## lemon              |  0.388  0.841  0.019  2.360 |  0.439  1.306  0.024  2.668
## milk               | -0.124  0.163  0.004 -1.101 |  0.184  0.436  0.009  1.636
## other              |  0.632  0.608  0.012  1.922 |  2.641 12.891  0.216  8.031
## tea bag            | -0.451  5.836  0.265 -8.910 | -0.277  2.675  0.100 -5.473
## tea bag+unpackaged |  0.140  0.313  0.009  1.638 |  0.818 12.911  0.305  9.552
## unpackaged         |  1.762 18.888  0.423 11.248 | -0.828  5.072  0.094 -5.289
##                       Dim.3    ctr   cos2 v.test  
## black              |  0.770  9.475  0.194  7.618 |
## Earl Grey          | -0.354  5.230  0.226 -8.225 |
## green              |  0.345  0.848  0.015  2.097 |
## alone              | -0.075  0.238  0.010 -1.770 |
## lemon              | -0.924  6.083  0.105 -5.615 |
## milk               |  0.541  3.988  0.078  4.826 |
## other              |  1.225  2.919  0.046  3.726 |
## tea bag            |  0.366  4.915  0.175  7.234 |
## tea bag+unpackaged | -0.432  3.787  0.085 -5.044 |
## unpackaged         | -0.600  2.799  0.049 -3.831 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.207 0.168 0.240 |
## How                | 0.034 0.280 0.204 |
## how                | 0.494 0.335 0.177 |
## sugar              | 0.140 0.004 0.185 |
## where              | 0.517 0.622 0.137 |
## age_Q              | 0.406 0.150 0.419 |
## relaxing           | 0.066 0.017 0.043 |
## sophisticated      | 0.097 0.006 0.047 |
## friendliness       | 0.011 0.041 0.090 |
# visualize MCA
plot(tea_mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

Interpret the results

First of all our model does not explain much, as it only explains just over 20% of the variation. This sead, we can make some conclusions (This might not be too clear from the knitted version, I used enlarged version of the plot to see it more clearly):

  1. The age group 35-44 likes to add lemon in their tea
  2. This previous group uses black tea and add lemon to it
  3. People 60 or over drink black tea
  4. It seems that unpacked tea is usually bough from tea shops
  5. Tea bags are bought from chain stores
  6. People who drink Earl Grey are likely to add sugar in their tea

But these are just indicators and by looking at the numbers more carefully, we might actually find that these are not very significant relationships at all.

Check the Screeplot

Just to visualize the percentages of inertia explained by each MCA dimensions we can as our last step use fviz_screeplot() from the factoextra package:

fviz_screeplot(tea_mca, addlabels = TRUE, ylim = c(0, 45))

As you can see, it’s quite flat

Just for fun, let’s see it for the whole dataset. We put age as supplementary quantitative variable, as it is a continuour number (integer) and not a factor. Then we can place the perception questions and questions that categorize the individuals as qualitative supplementary variables.

res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 45))

As we can see, the variables by themselves are not explaining much. The first 10 variables explain just over 50% of the variation in the dataset.